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This article considerss a Newton-like method already used by several au- 
thors but which has not been thouroughly studied yet. We call it the robust- 
variance scoring (RVS) algorithm because the main version of the algorithm 
that we consider replaces minus the Hessian of the loglikelihood used in the 
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Newton-Raphson algorithm by a matrix G which is an estimate of the vari- 
ance of the score under the true probability, which uses only the individual 
scores. Thus an iteration of this algorithm requires much less computations 
than an iteration of the Newton-Raphson algorithm. Moreover this estimate 
of the variance of the score estimates the information matrix at maximum. 
We have also studied a convergence criterion which has the nice interpretation 
of estimating the ratio of the approximation error over the statistical error; 
thus it can be used for stopping the iterative process whatever the problem. 
A simulation study confirms that the RVS algorithm is faster than the Mar- 
quardt algorithm (a robust version of the Newton-Raphson algorithm); this 
happens because the number of iterations needed by the RVS algorithm is 
barely larger than that needed by the Marquardt algorithm while the compu- 
tation time for each iteration is much shorter. Also the coverage rates using 
the matrix G are satisfactory. 

Keywords: likelihood, Newton algorithm, convergence tests, score, ro- 
bustness. 



1 INTRODUCTION 

The Newton-Raphson method is very efficient in maximizing the likelihood 
in problems of moderate complexity such as the Cox model for instance. 
However, the development of more and more complex statistical models in 
particular random-effect models and semi-parametric models together with 
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more complex observation schemes is a challenge for likelihood inference. In 
such complex situations we may not be in a convex optimization problem and 
we need more robust methods than the Newton- Raphson methods; moreover 
the likelihood is difficult to compute, most often involving non-analytical 
integrals, and analytical derivatives are not available. A robust version of the 
Newton-Raphson method can be developed using the idea of the Marquardt 
(or Levenberg-Marquardt) algorithm (Marquardt, 1963), but this algorithm 
needs the second derivatives of the loglikelihood (the Hessian). 

These difficulties have prompted the development of alternative algo- 
rithms. The most often used for likelihood inference is the EM algorithm 
(Dempster, Laird and Rubin, 1977). Many researchers also have adopted a 
Bayesian framework in which the MCMC algorithm can be used (Gilks et 
al., 1996). There are also applications of the MCMC algorithm for likelihood 
maximization using the EM algorithm (Wei and Tanner, 1990). However 
both EM and the Bayesian approaches may be very time-consuming in com- 
plex problems, although it is an active research field to improve the algo- 
rithms using these approaches: for instance Kuhn and Lavielle (2005) have 
proposed a fast version of the EM algorithm. Newton-like methods still have 
trumps in their hands. The first thing to note is that numerical derivatives 
can be easily computed; there are also semi-analytical ways of computing 
the derivatives in cases where analytical derivatives are available for a "full" 
problem and this has been used for instance by Hedeker and Gibbons (1994) 
and Commenges and Rondeau (2006). Advantages of Newton-like methods 
are the relatively small number of iterations for converging, the existence of 
good convergence tests and the inference that can be done using an estima- 
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tor of the information matrix at maximum. However the computation of the 
Hessian needed in the Newton-Raphson and Marquardt algorithm, even if 
feasible, may be time-consuming and inaccurate. 

The aim of this paper is to study a Newton-like algorithm proposed by 
Berndt et al. (1974); the idea can be traced back to Bock (1970) and the 
algorithm can be viewed as an extension of the Gauss-Newton algorithm. It 
has been extensively used in the psychometric and econometrics literature 
where it is called the BHHH algorithm, and maybe less in the biostatistical 
literature for which we may cite Hedeker and Gibbons (1994), Todem, Kim 
and Lesaffre (2006) and Ruppert (2005). We feel that it has still not been 
studied thoroughly: in particular far from the maximum the definition of 
the scoring matrix should be modified (Meilijsson, 1989). This algorithm 
estimates the information matrix at maximum by an estimator of the variance 
of the score using individual scores and far from the maximum the matrix 
is still an estimate of the variance of the score; for this reason we call it 
the Robust-variance scoring (RVS) algorithm. This algorithm requires much 
less computations than the Marquardt algorithm while retaining all of its 
advantages. Moreover we argue that a very good convergence test is available 
for this algorithm and that robust inference can be naturally done in case of 
mis-specified models. 

The organization of the paper is as follows. In section 2 we present some 
facts that are valid for a class of Newton-like algorithms; in particular we 
propose a convergence test which has an intrinsic statistical interpretation. 
In section 3 we present the RVS algorithm in detail and we show that robust 
inference can be done without computing the Hessian. In section 4 we make 
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some numerical experiments using a random-effect model with a complex 
observation scheme. Section 5 concludes. 

2 NEWTON-LIKE METHODS FOR LIKE- 
LIHOOD MAXIMIZATION 

2.1 NOTATIONS 

We assume a model {Pe}eeo, where 6 is a "nice" open subset of !R m so 
that this defines a regular parametric model (Rothenberg, 1971; Bickel et 
al, 1993). Let C e be the likelihood for the observation depicted by a a- 
field O; we shall assume that the observation is made of n independent 
parts so that O = \/Oi where the Oi are independent cx-fields, for instance 
the Oi are generated by independent random variables Yi. We note L\ = 
log^Q., the log-likelihood for observation i\ L 9 = J27=i^i the total log- 
likelihood. We shall define the maximum likelihood estimator as the solution 
of minimizing — L e ; this has the advantage of removing some minus signs 
in many equations; moreover most optimization problems are presented as 
minimization problems and the MLE can be viewed as minimizing a Kulback- 
Leibler divergence. We assume that the log-likelihood is twice-differentiable 
and we note Uf = U 8 = £?=i Uf and H(6) = The information 

matrix is 1(6) = E g [H(6)] = E e [U e (U e ) T }. We assume that the Lfs are 
independently identically distributed (iid) (the iid case is more general than 
may appear at first sight): then we have 1(9) = nl(9), where 1(9) does not 
depend on n. We assume that 1(9) is positive definite for all 9 G 0. For 
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the statistician there is a true probability measure P*; if P* belongs to the 
model there is a value 9* G such that P* = P^. We shall assume first 
that this is the case (we shall consider later that this is not the case, that 
is that the model is mis-specified). Under these conditions 9* is identifiable 
and the maximum likelihood estimator 9 is consistent and asymptotically 
efficient and we have n~ 1 / 2 (9 - 9,) -> iV(0, J^,)" 1 ). 

2.2 A CLASS OF NEWTON-LIKE METHODS 

We shall consider a class of Newton-like methods for likelihood maximization. 
The Newton algorithm is defined by the iteration 

9k+i = 9k~ H~ l {9 k )U{9 k ). 

This is the most efficient method for minimizing or maximizing a function 
when it is not too far from quadratic. When the function to maximize is 
the log-likelihood an additional advantage of the method is that H{9) is an 
estimator of 1(9* ) allowing to construct confidence intervals and tests for 9. 
In complex problems however this method has two main drawbacks: it may 
fail, in particular because the Hessian may not be positive-definite; it may 
be time consuming if the number of parameters is large and the likelihood 
difficult to compute. So we are led to consider Newton-like methods defined 
by iterations of the type: 

9 k+1 = 9 k -a k G~\9 k )U{9 k ), 

where a k is the step length and is found by line-search and where G(9 k ) has 
two main properties: 
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i) it is positive-definite for all 9; 

ii) it approaches H(9k) and I (9k) when 9k approaches 9. 
Another property is important: 

hi) if the model is re-parametrized in rj^ = A9k + a the following relation 
holds between G and the matrix G v for the transformed problem: G(9k) = 
A T G v ( V k)A. 

Moreover the ease of computation of G(9k) will be a major criterion for 
choosing among the methods which are successful in converging. 

Property (i) ensures that the algorithm is a descent method and thus 
will converge; property (ii) ensures that convergence near the maximum will 
be approximately as fast as with the Newton-Raphson algorithm and that 
inference about 9 will be possible by estimating 1(9*) by G(9). Property 
(iii) ensures that the algorithm is invariant by affine transformation of the 
parameters, that is we should not have problems with bad conditioned G, 
as long as G is non-singular and the numerical computations sufficiently 
accurate (see Fletcher, 1987). The three properties will also help finding a 
good convergence test. 

2.3 THREE NEWTON-LIKE METHODS 

Let us consider three Newton- like algorithms: the Marquardt, the Fisher- 
scoring and the RVS algorithms. The Marquardt algorithm (Marquardt, 
1963) uses a matrix G obtained by adding to H a positive-definite matrix. 
The simplest form is G(9k) = H(9k) + \kld where Id is the identity matrix 
and Afc is adaptively tuned so as to make the matrix G(9k) positive-definite, 
ensuring to G(9k) the property (i); if H is positive definite around the max- 
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imum, smaller and smaller A are used so that G for Marquardt has property 
(ii). The Marquardt algorithm has been found to be efficient in many sta- 
tistical problems (for instance Joly et al., 2002; Alioum et al., 2005; Proust 
and Jacqmin-Gadda, 2005), although this is a general method. The com- 
putational load for computing G(6k) is about the same as for computing 
H(9 k ). 

The Fisher-scoring algorithm uses G(6k) = I (0k). With our assumptions, 
we have property (i) and obviously property (ii). This choice also enjoys 
property (iii). The Fisher scoring algorithm has been used in particular in 
generalized linear models where I (6k) is generally easy to compute; it is an 
efficient and robust algorithm (McCullagh and Nelder, 1999). The algorithm 
is specific to likelihood maximization. In some particular cases I (6k) is easy to 
compute but this is not the case in general. In general problems, when there 
is no analytical form of the information matrix, the Fisher-scoring algorithm 
seems difficult to implement. 

Berndt et al. (1974) (see also Hedeker and Gibbons, 1994) used G(6k) = 
Yh=i Ui(d k )Uf(dk). Todem, Kim and Lesaffre (2006) used G(6 k ) = E*Li Ui(6 k )U?(6 k )- 
n~ 1 U(6k)U T (6k) , a correction proposed by Meilijsson (1989). However none 
of these authors have thoroughly studied this type of algorithm. Note that 
the two choices are equivalent near the maximum since U(6) = but not far 
from the maximum. Both choices enjoy the three properties (i), (ii) and (iii), 
although it may happen that (ii) is not satisfied with the second choice. The 
algorithm converged satisfactorily in the statistical problems treated by these 
authors. The main advantage of this type of choice is that G(6k) is much 
less computationally demanding than H(6k) since it requires only computing 
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the £/j(0fc)'s, that is only the first derivatives of the individual log-likelihoods. 
This advantage becomes huge when the number of parameters is large. This 
is the type of algorithm that we will study in section 4 under the name of 
"Robust-variance scoring" (RVS) algorithm. 

3 CONVERGENCE TESTS FOR NEWTON- 
LIKE METHODS 

3.1 CONVENTIONAL CONVERGENCE TESTS 

Using a good convergence test is essential in the efficiency of any iterative 
method. We shall call "convergence test" a method leading to the decision 
of stopping: generally it will take the form Ck < c, where Ck will be called 
"stopping criterion" and c stopping value. Stopping criteria based on the 
displacement in the parameter space ||0fc+i — Qk\\ or in the log-likelihood 
lfk+i _ ifk are no t satisfactory because small displacements may occur if 
the algorithm fails to find a good direction. A better criterion is based on 
||£/(#fc)|| 2 because a necessary condition for the maximum is U(9) = 0. It 
is nice to have a stopping criterion which is invariant by linear transform 
of 9; such a stopping criterion is based on U(9k)G~ 1 (9k)U(9k), where G has 
property (iii) (Fletcher, 1987: Dennis and Schnabel, 1996). It still remains 
to fix a value at which to stop. 

We can say a little more on the criterion U{9f t )G~ l {9f i .)U{9f t ) in the context 
of log-likelihood maximization. Dennis and Schnabel (1996) have noted that 
this stopping criterion is not invariant by a change of scale of the function 
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to maximize; however the log-likelihood does not have to be rescaled: it 
is defined up to an additive constant (which corresponds to the choice of 
the reference probability when defining the likelihood) but it can not be 
multiplied by an arbitrary constant without changing its meaning. Another 
property, also specific to likelihood maximization, is that the criterion is 
asymptotically invariant near the maximum by any one-to-one transforms. 
This follows from Delta-calcul. 

3.2 AN OPTIMAL CONVERGENCE TEST 

Here we shall develop an "optimal" stopping criterion for maximizing the 
likelihood. It is good to have a stopping criterion which has the invariance 
properties mentioned above. However it is still difficult to choose a value 
at which to stop. A way to solve the problem is to devise a criterion which 
can be interpreted and which has the same meaning for different problems. 
The proposed solution is to stop when the approximation error is small with 
regard to the statistical error. The approximation error for a given 9 can be 
measured by d(9k, 9), where d(., .) is a distance defined on 3? m . The statistical 
error can be defined as Eg f [d(9,9*)}, so that the criterion is 

c d(9 k ,9) 

and it is natural to take the stopping value c between 10 -2 and 10 -4 , whatever 
the problem. Let us define the distance d(x, y) by a norm of x — y of the 
type: d(x,y) = {x — y) T M(x — y). It is desirable that the distance be 
invariant by an affine transformation of the parameters, so it is natural to 
take M — G, the matrix used in our algorithm and which has property 
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(iii). Near the maximum we have by property (ii) that G(9k) ~ I{0*) ~ !{&)■ 
Thus we have E e .[d(0, 0*)] « E0j(0-0*) T /(0*)(0-0*)]. Using the asymptotic 
property of the right-hand term is the expectation of a chi-squared with 
m degrees of freedom; thus [d(6, ~ m. As for the numerator, using 
the first order Taylor expansion U{9^) ~ I(9)(9k — 6) it can be seen that we 
have d(9k,9) ~ U(6k)G~ l (6k)U{6k)- Thus we have an approximation of the 
criterion Cfc as: 

This is a scaled version of the invariant criterion mentioned in the previous 
paragraph, but now we have an interpretation of it and we can choose a 
stopping value which does not need to depend on the problem at hand. 

A good stopping criterion must not take small values far from the maxi- 
mum, a criticism that has been addressed to other criteria in section 3.1. It 
is difficult to prove that Ck has this property for general G because the only 
properties that we have specified and which are relevant far from the maxi- 
mum, (i) and (iii), are not sufficient to impose large values of C. We shall see 
in section 4.3 that the result is easy to obtain for the RVS algorithm. Also 
in the appendix we put forward the idea that in certain difficult problems we 
might decide to stop the iterations before full convergence. 
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4 THE ROBUST- VARIANCE SCORING (RVS) 
ALGORITHM 



4.1 DESCRIPTION OF THE BASIC ALGORITHM 

Let us study the algorithm based on G(6 k ) = YJU U i {e k )Uf{e k )-r] k rr 1 U{e k )U T {e k ). 
Hedeker and Gibbons (1994) chose r] k = 0. This choice may lead to inefficient 
directions; it can be seen that it leads to small displacements for the com- 
ponents having larges values of U. Todem, Kim and Lesaffre (2006) chose 
rj k — 1. The above problem does not appear with this choice. Moreover 
G(9 k ) can then be interpreted as an estimator of the variances of U under 
the true model, and this warrants the name of "Robust- variance scoring algo- 
rithm" . Indeed, under certain assumptions, the law of large number gives us: 
n- l G(6 k ) -> n-VarpJf/^)]. Note that var P J[/(0 fe )] ^ vw 9h [U{0 h )] = I(6 k ), 
so that the RVS algorithm is different from the Fisher-scoring algorithm, even 
asymptotically. 

One possible problem with the choice r] k = 1 is that it is not sure that 
G{6k) is not singular. One possibility to ensure property (i) is to use r\ k 
slightly smaller than 1, possibly in an adaptive way as in the Marquardt 
algorithm; an adaptive choice of r] k can also be done for optimizing the 
direction of search. 

4.2 INFERENCE USING G 

We may use G(6) as an estimator of varp„ [[/(#*)]. This may be slighlty 
biased when the number of parameters is large relatively to n and one might 
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consider reducing the bias by using -^z^G(9). If the model is well specified we 
have varp t [{/(#*)] = varg, [{/(#*)] = 1(6*), so that tests or confidence intervals 
based on G(9) are asymptotically equivalent to tests based on H(9). So we 
may estimate vare t (^) by G^ 1 (9). 

If the model is mis-specified (that is P* does not belong to {Pg}g £ @), 9* has 
the meaning that Pe t is the probability measure of {Pe}e&e which is the clos- 
est to P* with respect to Kullback-Leibler divergence (Burnham and Ander- 
son, 1998) and 9 is consistent for 9*. Royall and Tsou (2003) call 9* the object 
of inference: for inference to be relevant with a mis-specified model the object 
of inference must have the same meaning in the true probability P*. When the 
model is mis-specified we no longer have varp„ [U (9*)] = 1(9*); inference based 
on 1(9*) is not valid while robust inference based on G(9) in the spirit of Roy- 
all (1986) is still feasible. The variance of the estimators of the parameters 
can be obtained by the sandwich estimator: vaxp^(9) = H^ 1 (9)G(9)H~ l (9). 
However this estimator has the disadvantage of requiring the computation 
of the Hessian while the algorithm itself avoided it. In some cases this is 
feasible to compute it once for the computation of confidence intervals. 

In the case where it is not feasible to compute the Hessian it should be 
possible to construct a robust confidence interval using the following idea. For 
any value 9 a score test of = 9" can be done: the null hypothesis is rejected 
if U T (9)G^ 1 (9)U(9) > c a , where we use the Xm asymptotic distribution of 
the statistic and c a is the critical value for constructing a size-a test based on 
this distribution. Thus the set {9 : U T (9)G~ l (9)U (9) < c a } is a confidence 
ellipsoid at level a. It seems possible to derive algorithms for constructing 
confidence interval along these lines. 
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4.3 CONVERGENCE TEST 

The convergence test described in section 3.2 can be applied to the RVS 
algorithm and we have seen that it enjoys a very nice interpretation near 
the maximum. We wish in addition that the probability of stopping before 
reaching the region of the maximum be very low. 

Note that with our assumption that 1(9) > there is a unique 9 such that 
Eg t [U(8)} = and this is 8*; in case where the model is well specified this 
value defines the true probability measure. This does not mean that there is 
a unique maximum of the likelihood for given observation but the probability 
of several maxima tends toward zero. Suppose that we wish to test the hy- 
pothesis E dit [U(9k)] = 0; a score test statistic is U(9 k ) T [yarp :t [U(9k)}}~ 1 U(9 k ); 
since G(9k) is an estimator of varp„ [U(9k)}, Ck can be considered as a test 
statistic for E<^ [U(9k)} =0. If c is the stopping value there is identity between 
the decision of not stopping and the decision of rejecting the null hypothesis. 
Thus our convergence test is a statistical test (outside of the region of the 
maximum). The null hypothesis is rejected if U(9k) T G(9k)~ 1 U(9k) > cm; 
as c is small the size of the test will be large. This is of course not a con- 
ventional testing situation and what we wish is that the power be large. If 
E ,[U(9 k )] ^ the distribution of U(9 k ) T G(9 k )~ 1 U(9 k ) is asymptotically a 
non-central chi-squared and the probability of rejecting the null hypothesis 
tends toward 1 (that is, the probability of stopping is close to zero for large 
n) . In fact even if we neglect the non-centrality term, the probability of stop- 
ping is very low with the choice c = 10~ 2 and m moderately large or large. 
For instance for m = 10 we have P*(C k < c) < P(xlo < = 2.51CT 8 . 
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4.4 SPEED 



The algorithm is useful for problems in which the computation of the likeli- 
hood is difficult and time-consuming and the derivatives of the loglikelihood 
can not be computed analytically. In that case the derivatives will be com- 
puted by finite difference (Overton, 1981) and the computational load for one 
iteration is approximately proportional to the number of likelihood evalua- 
tions for computing U(0k) and G(9k). For computing U(9k) one has to com- 
pute the Ui(6k), so that G(9k) is a by-product of the computation of U(9k) for 
the RVS algorithm. This computation takes 2m evaluations of the loglikeli- 
hood if one uses centered differences for approximating the derivative. This is 
to be compared to the Newton or Marquardt algorithm for which one has to 
compute H(9k) which requires the computation of m(m + 1)/2 second deriva- 
tives. Even if these second derivatives are computed with only one likelihood 
evaluation this leads to m(m + 5)/2 evaluations. Thus the ratio of com- 
putation required for one iteration of the RVS compared to the Marquardt 
algorithm is approximately 4/(m + 5) which takes values 0.26,0.16,0.11,0.08 
for m = 10, 20, 30, 40 respectively; that is for large problems the computa- 
tional burden for one iteration of the RVS algorithm is about ten times less 
than for the Marquardt algorithm. 

In terms of accuracy, iterating the numerical differentiation, necessary 
for the Marquardt algorithm, may lead to unacceptable loss of precision 
by cancellation error (Overton, 1981) although in some cases there is the 
possibility to make more precise computations (Commenges and Rondeau, 
2006). For the RVS algorithm the numerical differentiation is not iterated, 
diminishing the accuracy problem; in some cases it is also possible to avoid 
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the numerical differentiation. A problem may occur in the RVS algorithm if n 
is small because the rationale of the algorithm is based on the fact that G(6k) 
is an estimator of waipJ\U{9k)} using the law of large numbers; as mentioned 
by Song et al. (2005) this estimator may be unstable for small n jeopardizing 
both convergence of the algorithm and inference based on G(8). 

4.5 APPLICATION TO PENALIZED LIKELIHOOD 

Penalized likelihood is defined as: 

pl(6)=L e -J(6,K), (1) 

where k = = 1, . . . ,K) is a set of smoothing coefficients. Penalized 

likelihood is useful for obtaining smooth non-parametric estimators of func- 
tions (O'Sullivan, 1986; Joly et al., 2002), so that 9 may contain functions. 
However the solution is generally approximated on a basis of splines so that 
we are driven back to the parametric case, where part of the parameters are 
splines coefficients. 

For applying the RVS algorithm it is tempting to use the same formula as 
above replacing the t/j's by ^ i where pU = L e — rT x J(6, k). However for the 
choice rjk = 1 we then obtain the same matrix as for the ordinary likelihood 
because this matrix estimates var([7) and var(^) = var({7). If we wish to 
have an efficient direction near the maximum it is appropriate to use: 

i=l 

With this choice G{9 k ) ~ ^^(0k) near the maximum. It is generally easy to 
compute ^g^. 
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5 NUMERICAL EXPERIMENTATION 



We performed a simulation study to compare RVS with the Marquardt algo- 
rithm for computing maximum likelihood estimators of linear mixed models 
from left censored longitudinal data. Indeed left censoring of biological mea- 
sures (such as HIV-RNA) frequently arises due to the lower detection limit 
of the measurement tools. Jacqmin-Gadda et al. (2000) have shown that 
the contribution to the likelihood for one subject when some measures are 
left censored is the product of the multivariate Gaussian density for the 
completely observed measures and of a multivariate Gaussian distribution 
function for the censored measures given the observed ones. Computation 
of the distribution function for each subject requires a numerical integration 
of size equals to the number of censored measures. Thus, the computation 
time highly depends on the proportion of censored measures and thus on the 
threshold value. 

Data were generated according to either a linear model with an auto- 
regressive error structure or a linear model with random intercept and slope: 
Model AR : 

Yij = Po + ftitij + + fizXitij + Wij + eij 

with (3 = 4, Pi = —0.5, p 2 = —0.5, /3 3 = —0.1. The Gaussian variable 
has mean 0, and covariance cov(wij,wu) = al,exp(—5\tij — tu\) with a 2 ^ = 1, 
5 = 0.1. The variance of the independent Gaussian random error a\ was 1. 

Model RE : 

Yij = Po + Pitij + PiXi + P^Xitij + a,Qi + antij + 
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The two random effects and an have a centered Gaussian distribution 
with o"o = var(aoi) = 0.25, a\ = var(an) = 0.1 and cr i = cov(aoi,au) = 
—0.1. The rest of the model was unchanged. 

The response values less than the threshold were left censored. The 
threshold was either 1.0 or 2.0 leading respectively to about 25% and 45% of 
censored measures. 

For each model and each threshold, we generated 100 samples of 100 
subjects with a number of measures randomly selected according to a uniform 
distribution between 5 and 11. The covariate X had a Bernoulli distribution 
with P=0.5 and the time variable followed a U[0-6] distribution. 

The starting values for the estimation algorithms were the maximum 
likelihood estimates obtained by imputing the threshold for the censored 
measures. Convergence was reached when the stopping criteria Ck defined 
in section 3.2 reached the stopping value 10 -4 in less than 30 iterations for 
the Marquardt algorithm and in less than 30*(m+5)/4 for RVS (with m=7 
or 8 according to the model). Results were not different when increasing 
the maximum number of iterations. In both algorithms, derivatives were 
computed by finite difference and a line-search was used to select the step 
length a.k only when the updated parameters did not improve the likelihood 
because this strategy was found to be efficient. 

Convergence results are displayed in Table 1. The proportion of successful 
convergence was always higher for the RVS algorithm and the convergence 
time was clearly lower even if the number of iterations required to reach 
convergence was slightly higher. The ratio of computation time required for 
one iteration of RVS compared to the Marquardt algorithm was very close to 
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the expected value 4/(m+5) and even slightly more favorable to RVS than 
expected (0.29 vs 0.33 expected for the AR model) because the Marquardt 
algorithm requires more computations in addition to computations of the 
derivatives (diagonal inflation and maybe more often line search for the step 
length). 

When both algorithms converged the estimates were nearly identical: the 
mean sum of squared differences between the estimates was less than 3.10 -6 
in the four situations. Table 2 presents the mean asymptotic variance, the 
sample variance and the coverage rate of the 95% confidence interval of the 
estimates for the two models and the two algorithms when the censoring 
threshold was 2.0. This shows that the variances of the estimators were 
correctly estimated by the two methods (H~ l or G _1 ). Only two coverage 
rates were significantly different from the nominal values 95% (with the AR 
model and the Marquardt algorithm). 

6 CONCLUSION 

We have studied a Newton-like method for maximizing the likelihood which 
needs only the computation of the scores and does not need the compu- 
tation of the Hessian. A simulation study of random effects models with 
left censored observations confirms that the RVS algorithm is faster than 
the Marquardt algorithm and this happens because the number of iterations 
needed by the RVS algorithm is not much larger. The advantage of the RVS 
algorithm over methods needing computation of the Hessian increases with 
the number of parameters. In fact the Levenberg-Marquardt idea of increas- 
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ing the diagonal of the matrix for improving its condition number could be 
applied to the RVS algorithm. The simulation study showed that the cover- 
age rate using the matrix G was good. We have also studied a convergence 
criterion for stopping the iterative process which has a statistical interpreta- 
tion leading to an easy choice of the stopping value. Further work is needed 
for studying inference in misspecified models without computing the Hessian 
and this could be developed along the lines of section 4.2. The RVS is the 
algorithm of choice for maximum likelihood inference in complex problems 
when the Hessian of the loglikelihood is difficult to compute and when the 
number of parameters is large. It can be applied in particular for estimating 
parameters in models specified by systems of non-linear differential equations 
(Guedj et aL, 2007). 
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APPENDIX: STOPPING BEFORE FULL CON- 
VERGENCE 

In some difficult problems it may save time to stop before full convergence, 
at least in an exploratory phase. We compute U — U + e where e is a nu- 
merical error. The matrix G is also affected by numerical error but there is 
a specific problem with U because the norm of U decreases near the maxi- 
mum, so that the relative error increases. The computed search direction is 
G-\9 k )U(9 k ) = G~ 1 (9 k )U(9 k ) + G- X {9 h )e k . The G-norm of the first term 
is U{d k ) T G- 1 {d k )U{d k ) « d(6 k ,9) and for the second term it is e T G- x (d k )e, 
so that the relative error in the search direction is r e = £ %- ffi^ £ which in- 

d(0 k ,9) 

creases as 9 k approaches 9. In particular when r e w 1 we do not have one 
exact significant digit so that further progression toward 9 becomes impos- 
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sible. This means that in term of stopping value it will be nearly impossible 
to go below c = £ ° m ^ £ - m difficult problems (where the computation 
of the likelihoods involves high-order multiple integrals) it may very time- 
consuming to obtain (by increasing the precision of the computation of the 
integral) a value c below the value that we would have wished, for instance 

io- 2 . 

It may be interesting to stop at a value c such that the approximation error 
is smaller than the statistical error but can not be considered as negligible, 
for instance c = 0.5. Let us denote 9 a parameter value satisfying the conver- 
gence test with the stopping value c; 9 is not far from 9 and in fact, considered 
as an estimator of 9 it shares its consistency property. Moreover we can con- 
struct a conservative confidence region for 9 based on 9. An asymptotic (1— a) 
confidence region for 9 based on 9 can be defined as R\- a = {9 : d(9, 9) < c a } 
where d(9,9) = U{9) T G- 1 {9)U{9) and c a is the (1 - a) quantile of the \ 2 m 
distribution (Knight, 2000). Consider now the confidence region centered on 
9: R^ a = {9 : d(9, 9) < c a +d(9, 9)}. We have that R^ a C Ri- a . The result 
can be proved by using the triangular inequality d(9, 9) < d(9, 9) + d(9, 9) to 
show that d(9, 9) < c a + d(9, 9)} ==>• d(9, 9) < c a }. Thus Ri- a is a conserva- 
tive (1 — a) confidence region for 9. Confidence regions do not give directly 
usual confidence intervals for each component of 9 (because confidence re- 
gions take multiplicity into account). However we go from R\- a to R\~ a by 
inflating G" 1 by the factor (c a + d(9,9))/c a ; it is tempting to do the same 
for confidence intervals, at least in an exploratory phase. 
As an example a 0.95 confidence ellipsoid for m = 10 is obtained with c a = 
18.3. If we stop at = 0.4 we have d(9, 9) w 4; so the conservative confidence 
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ellipsoid is -R0.95 — {@ '■ d(6,6) < 22.31} and the inflating factor is 1.22; 
applying this to confidence intervals this would lead to intervals inflated by 
Vl.22 = 1.1. 
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Table 1 : Convergence statistics from the simulation study comparing the 
Marquardt and RVS algorithms for findin the the MLE in a mixed model 
from left-censored longitudinal data. 



Marquardt RVS 



Model AR 
threshold =1.0 

# convergence reached 97 97 
Mean convergence time 47.7 21.0 
Mean iterations number 6.2 9.6 

threshold =2.0 

# convergence reached 83 97 
Mean convergence time 206.3 71.8 
Mean iterations number 9.6 11.5 
Model RE 

threshold =1.0 

# convergence reached 95 98 
Mean convergence time 19.6 9.1 
Mean iterations number 7.3 11.4 

threshold =2.0 

# convergence reached 86 97 
Mean convergence time 128.1 43.8 
Mean iterations number 11.3 13.1 
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Table 2 : Asymptotic and sample variances of the estimates and coverage 
rates of the 95% confidence intervals of the maximum likelihood estimators 
obtained with Marquardt and RVS algorithms when the censoring 

threshold was 2.0. 



Marquardt RVS 





Asymptotic 


Sample 


Coverage 


Asymptotic 


Sample 


Coverage 


parameter 


variance 


variance 


rate 


variance 


variance 


rate 


Model AR 
















0.036 


0.040 


91.6 


0.038 


0.040 


91.8 


01 


0.0021 


0.0027 


89.2 


0.0024 


0.0024 


91.8 


02 


0.072 


0.096 


91.6 


0.079 


0.091 


93.8 


A 


0.0048 


0.0056 


92.8 


0.0054 


0.0050 


94.9 




0.029 


0.036 


88.0 


0.036 


0.034 


92.8 


s 


0.0027 


0.0027 


90.4 


0.0031 


0.0025 


94.9 




0.0094 


0.0095 


95.2 


0.010 


0.009 


96.9 


Model RE 














00 


0.018 


0.016 


94.2 


0.020 


0.018 


94.9 


01 


0.0036 


0.0035 


95.4 


0.0041 


0.0034 


97.9 


02 


0.037 


0.037 


97.7 


0.042 


0.039 


99.0 


03 


0.0077 


0.0068 


96.5 


0.0088 


0.0067 


99.0 


-I 


0.016 


0.017 


90.7 


0.021 


0.018 


94.9 


001 


0.0026 


0.0026 


91.9 


0.0034 


0.0030 


93.8 


°l 


0.0008 


0.0008 


91.9 


0.0011 


0.0008 


94.9 


°l 


0.0065 


0.0070 


95.3 


0.0074 


0.0064 


95.9 
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